Final Project: Boston Weather Trend
This project tends to explore the weather trend of Boston during the past 82 years, from Jan. 1, 1936 to May 1, 2018.
Specifically, we will be seeing the following analysis in this project:
- Temperature Gradient of the past 82 years by month for horizontal comparison;
- Decomposed Time series of the average temperature during the past 82 years;
- Temperature Forecast for the future five years;
- Correlations between average temperature and other weather features.
I’ll also test the following hypothesis through this project:
1. The average temperature is gradually increasing overall;
2. The precipitation is positively correlated with the temperature;
3. The temperature is negatively correlated with the wind speed;
4. The amount of snowfall is negatively correlated with the temperature.
# install.packages("tidyverse")
# install.packages(c("RColorBrewer", "dplyr", "dygraphs", "ggfortify"))
#install.packages("ggfortify")
#install.packages("rmdformats")
library(tidyverse)## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(dplyr)
library(dygraphs)
library(ggfortify)## Warning: package 'ggfortify' was built under R version 3.4.4
library(rmdformats)0. Data Preparation
The data comes from National Center for Environmental Information: https://www.ncdc.noaa.gov.
#boston <- read.csv("https://www.ncei.noaa.gov/orders/cdo/1337764.csv")
#save(boston, file="boston.Rdata")
load("boston.Rdata")
datetxt <- boston$DATE
datetxt <- as.Date(datetxt)
boston <- add_column(boston,
YEAR = as.numeric(format(datetxt, format = "%Y")),
Month = as.numeric(format(datetxt, format = "%m")),
DAY = as.numeric(format(datetxt, format = "%d")), .after = 6) %>%
mutate(MONTH = month.abb[Month])
boston$month_ordered <- factor(boston$MONTH, levels = month.abb)
boston <- boston[,c("NAME","DATE","YEAR","Month","DAY","PRCP","SNOW","SNWD","TAVG","TMAX","TMIN","MONTH","month_ordered","AWND")]1. Temperature Gradients
Average Temperature
ggplot(data=boston, aes(x=YEAR,y=month_ordered)) +
geom_tile(aes(fill = TAVG),colour = "white") +
scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
limits = c(-20, 100)) +
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.title.x=element_blank(), plot.title = element_text(hjust = .5)) +
ggtitle("Average Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Average)")+
scale_x_continuous(breaks=seq(1936,2018,5))Max Temperature
ggplot(data=boston, aes(x=YEAR,y=month_ordered)) +
geom_tile(aes(fill = TMAX),colour = "white") +
scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
limits = c(-20, 100)) +
theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) +
ggtitle("Maximum Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Max)")+
scale_x_continuous(breaks=seq(1936,2018,5))Min Temperature
ggplot(data=boston, aes(x=YEAR,y=month_ordered)) +
geom_tile(aes(fill = TMIN),colour = "white") +
scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
limits = c(-20, 100)) +
theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) +
ggtitle("Minimum Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Min)")+
scale_x_continuous(breaks=seq(1936,2018,5))According to the temperature gradient maps above, temperature in June, July and August is obviously higher than other months, while December, January and February generally have the lowest temperature. December, 1962 stands out for its extradinarily low temperature.
Precipitation
ggplot(data=boston, aes(x=YEAR,y=month_ordered)) +
geom_tile(aes(fill = PRCP),colour = "white") +
scale_fill_gradientn(colours=brewer.pal(9,'BrBG'), na.value = "grey50",
limits = c(0, 7)) +
theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) +
ggtitle("Pricipitation/inch of Boston from 1936-1-1 to 2018-5-1 (Based on Daily PRCP)")+
scale_x_continuous(breaks=seq(1936,2018,5))As shown above, the overall precipitation in Boston is not too much. There are some months standing out with high precipitation, among which May, 1984 is the top 1 month with highest precipitation.
2. Time Series
monthly <- boston %>% select(month_ordered,YEAR,TAVG, PRCP, AWND, SNOW) %>%
group_by(month_ordered,YEAR) %>%
summarise(TAVG = mean(TAVG),PRCP = mean(PRCP), AWND = mean(AWND), SNOW = mean(SNOW)) %>%
arrange(YEAR)Average Temperature
Overall time series
myts <- ts(monthly$TAVG,start=c(1936,1), end=c(2018,4), frequency=12)
ts.decompose <- decompose(myts)
autoplot(ts.decompose)+ggtitle("Time Sereis of Average Tempetation")## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 24 rows containing missing values (geom_path).
Observed
dygraph(ts.decompose$x) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)Trend
dygraph(ts.decompose$trend) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)Seasonal
dygraph(ts.decompose$seasonal) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)Random
dygraph(ts.decompose$random) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)Although the observed time series shows a seasonal temperature change with regular pattern, the trend time series tells us that the overall temperature of Boston is in a rising trend.
Precipitation
ts_prcp <- ts(na.omit(monthly$PRCP),start=c(1936,1), end=c(2018,4), frequency=12)
autoplot(decompose(ts_prcp))+ggtitle("Time Sereis of Precipitation")## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 24 rows containing missing values (geom_path).
The time series of precipitation demonstrates a less seasonal pattern. But it has several peaks with an interval of 30 years. For instance, the year of 1955 or os has a peak, and around 30 years later, precipitation reached another peak in 1983, which is followed by the next peak in 2010.
3. Forecast
hw <- HoltWinters(myts)
predict_temp <- predict(hw, n.ahead = 24,
prediction.interval = TRUE,
level = as.numeric(0.95))
dygraph(predict_temp, main = "Predicted Average Temperature/Month") %>%
dySeries(c("lwr", "fit", "upr"), label = "Temperature") %>%
dyOptions(drawGrid = F) %>%
dyRangeSelector()The forecast of the following 24 months follows a regular pattern, with the temperature peaking in July as around 75ËšF, and January will be having the lowest temperature 31 ËšF.
4. Correlation between Temperature and other parameters
pal <- colorRampPalette(c("blue", "yellow", "red", "green"))
mycols=pal(12)Correlation between Temperature and Precipitation
ggplot(data=monthly,aes(x=TAVG,y=PRCP,color=factor(month_ordered))) +
geom_point(alpha=.5) + scale_color_manual(values=mycols) +
xlab('Temperature') + ylab('Precipitation')## Warning: Removed 1 rows containing missing values (geom_point).
ggplot(data = monthly,mapping = aes(x=TAVG,y=PRCP)) + geom_point()+
geom_smooth(method = "glm", method.args = list(family = "gaussian"))+
xlab('Temperature') + ylab('Precipitation')## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
cor.test(monthly$TAVG, monthly$PRCP, use = "complete.obs")##
## Pearson's product-moment correlation
##
## data: monthly$TAVG and monthly$PRCP
## t = -3.8407, df = 986, p-value = 0.0001305
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.18239623 -0.05949065
## sample estimates:
## cor
## -0.1214088
From the above plots, we can come to the conclusion that the temperature is barely correlated to the precipitation of Boston. I also conduct correlation test on these two variables, the result of which shows that the r value is -0.121, a very weak negative relationship between these two variables. The precipitation among the 12 months has no obvious difference.
Correlation between Wind Speed and Temperature
ggplot(data=monthly,aes(x=AWND,y=TAVG,color=factor(month_ordered))) +
geom_point(alpha=.5) + scale_color_manual(values=mycols) +
xlab('Wind speed') + ylab('Temprature')## Warning: Removed 576 rows containing missing values (geom_point).
ggplot(data = monthly,mapping = aes(x = AWND, y = TAVG)) + geom_point()+
geom_smooth(method = "glm", method.args = list(family = "gaussian")) +
xlab('Wind speed') + ylab('Temprature')## Warning: Removed 576 rows containing non-finite values (stat_smooth).
## Warning: Removed 576 rows containing missing values (geom_point).
cor.test(monthly$TAVG, monthly$AWND, use = "complete.obs")##
## Pearson's product-moment correlation
##
## data: monthly$TAVG and monthly$AWND
## t = -16.233, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6804793 -0.5624534
## sample estimates:
## cor
## -0.6250255
The above plots demonstrate that there is a strongly negative correlation between the temperature and wind speed in Boston. As the wind speed increases, the average temperature decreases. Correlation test is conducted on these two variables. the result shows that the r value is -0.625, a strong negative relationship between these two variables. Besides, the second plot shows an obvious difference in terms of the wind speed among different months. High wind speed are likely to concentrate in January and February, while June, July and August have relatively low wind speed.
Correlation between Snow and Temperature
ggplot(data=monthly,aes(x=TAVG,y=SNOW,color=factor(month_ordered))) +
geom_point(alpha=.5) + scale_color_manual(values=mycols) +
xlab('Temprature') + ylab('Snow')ggplot(data = monthly,mapping = aes(x = TAVG, y = SNOW)) + geom_point()+
geom_smooth(method = "glm", method.args = list(family = "gaussian"))+
xlab('Temprature') + ylab('Snow')cor.test(monthly$TAVG, monthly$SNOW, use = "complete.obs")##
## Pearson's product-moment correlation
##
## data: monthly$TAVG and monthly$SNOW
## t = -6.9185, df = 987, p-value = 8.2e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2737319 -0.1548030
## sample estimates:
## cor
## -0.2150647
From the above plots, we can come to the conclusion that the temperature is slightly negatively correlated to the amount of snowfall in Boston. As the temperature goes up, the amount of snowfall decreases. I also conduct correlation test on these two variables, the result of which shows that the r value is -0.215, a weak negative relationship between these two variables. Besides, the second plot shows that the snowfall in Boston concentrates in three months: December, January and February.
5. Regression
regression <- lm(data = boston, TAVG ~ AWND)
regression##
## Call:
## lm(formula = TAVG ~ AWND, data = boston)
##
## Coefficients:
## (Intercept) AWND
## 64.955 -1.121
summary(regression)##
## Call:
## lm(formula = TAVG ~ AWND, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.16 -12.91 0.31 13.32 40.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.9546 0.4797 135.40 <2e-16 ***
## AWND -1.1210 0.0396 -28.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.63 on 12538 degrees of freedom
## (17532 observations deleted due to missingness)
## Multiple R-squared: 0.06008, Adjusted R-squared: 0.06001
## F-statistic: 801.4 on 1 and 12538 DF, p-value: < 2.2e-16
AIC(regression)## [1] 106097.1
BIC(regression)## [1] 106119.4
regression2 <- lm(data = boston, TAVG ~ AWND + PRCP)
regression2##
## Call:
## lm(formula = TAVG ~ AWND + PRCP, data = boston)
##
## Coefficients:
## (Intercept) AWND PRCP
## 65.097 -1.155 2.071
summary(regression2)##
## Call:
## lm(formula = TAVG ~ AWND + PRCP, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.85 -12.92 0.11 13.17 41.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.09705 0.48030 135.535 < 2e-16 ***
## AWND -1.15494 0.04024 -28.698 < 2e-16 ***
## PRCP 2.07055 0.45642 4.536 5.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.62 on 12535 degrees of freedom
## (17534 observations deleted due to missingness)
## Multiple R-squared: 0.06169, Adjusted R-squared: 0.06154
## F-statistic: 412 on 2 and 12535 DF, p-value: < 2.2e-16
AIC(regression2)## [1] 106060
BIC(regression2)## [1] 106089.8
regression3 <- lm(data = boston, TAVG ~ AWND + PRCP + SNOW)
regression3##
## Call:
## lm(formula = TAVG ~ AWND + PRCP + SNOW, data = boston)
##
## Coefficients:
## (Intercept) AWND PRCP SNOW
## 64.038 -1.041 3.840 -3.522
summary(regression3)##
## Call:
## lm(formula = TAVG ~ AWND + PRCP + SNOW, data = boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.299 -12.665 -0.118 12.975 66.685
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.0380 0.4745 134.966 <2e-16 ***
## AWND -1.0409 0.0399 -26.087 <2e-16 ***
## PRCP 3.8401 0.4560 8.421 <2e-16 ***
## SNOW -3.5219 0.1654 -21.290 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.32 on 12534 degrees of freedom
## (17534 observations deleted due to missingness)
## Multiple R-squared: 0.09444, Adjusted R-squared: 0.09422
## F-statistic: 435.7 on 3 and 12534 DF, p-value: < 2.2e-16
AIC(regression3)## [1] 105616.6
BIC(regression3)## [1] 105653.8
The results of regression listed above tell us that all the parameters, namely wind speed, precipitation and snowfall, have a significant impact on the change of temperature in Boston. The limit of this project is that the data is not complete. For the future studies, I would recommend researchers to explore more weather parameters and their influence on the temperature.